Mixed Mode Matrix Multiplication
نویسندگان
چکیده
In modern clustering environments where the memory hierarchy has many layers (distributed memory, shared memory layer, cache, ), an important question is how to fully utilize all available resources and identify the most dominant layer in certain computation. When combining algorithms on all layers together, what would be the best method to get the best performance out of all the resources we have? Mixed mode programming model that uses thread programming on the shared memory layer and message passing programming on the distributed memory layer is a method that many researchers are using to utilize the memory resources. In this paper, we take an algorithmic approach that uses matrix multiplication as a tool to show how cache algorithms affect the performance of both shared memory and distributed memory algorithms. We show that with good underlying cache algorithm, overall performance is stable. When underlying cache algorithm is bad, superlinear speedup may occur, and increasing number of threads may also improve performance. 1. Memory Hierarchies in the modern clustering environments Figure 1 shows the memory hierarchy that exists in most nodes of modern clustering environments. Globally, many nodes are linked together by a high-speed network; inside each node there may be many processors; along with each processor memory access is either to a high speed memory unit “cache” or the low speed “main memory”. In our mixed-mode programming model we use message passing interface, MPI, for the data communication between the global nodes. Inside each MPI process we have two choices, one is to use POSIX threads for creating threads, one or many threads may belong to the MPI process mapped to the node. The other choice is again to use MPI for local processes mapped to all processors of the node. Inside each process we use different algorithms that utilize the cache. Another option for threads that was not explored in this project was to use the OpenMP standard. Based on the specific programming model, we selected several matrix multiplication algorithms on each layer and implemented them. Bova et. al., [3] determined that, “On a 100-CPU machine, using 100 MPI workers to perform a 100-component harbor simulation is inefficient due to inappropriate load balance. It would be more efficient to have 25 MPI workers create four OpenMP threads for each assigned wave component.” In our experiments, we show that even in a perfectly load balanced computation such as matrix multiplication, the overall mixed mode performance is highly affected by cache algorithms. Our testing platform is the IBM SP system at the National Energy Research Scientific Computing facility [1].
منابع مشابه
A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure
The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...
متن کاملExploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration
Recent advances in multi-million-gate platform FPGAs have made it possible to design and implement complex parallel systems on a programmable chip (PSOPCs) that also incorporate hardware floating-point units (FPUs). These options take advantage of resource reconfiguration. In contrast to the majority of the FPGA community that still employs reconfigurable logic to develop algorithm-specific cir...
متن کاملAnalog Array Processor with Digital Resolution Enhancement and Offset Compensation
Abstract — A mixed-mode inner-product vector processor is presented. It performs high-dimensional matrix-vector multiplication on a fine-grain analog array and has a purely-digital interface. The array incorporates charge-mode analog computational cells and row-parallel analog-to-digital converters (ADC). Each of the cells includes a dynamic storage element and a charge injection device computi...
متن کاملBenchmarking mixed-mode PETSc performance on high-performance architectures
The trend towards highly parallel multi-processing is ubiquitous in all modern computer architectures, ranging from handheld devices to large-scale HPC systems; yet many applications are struggling to fully utilise the multiple levels of parallelism exposed in modern high-performance platforms. In order to realise the full potential of recent hardware advances, a mixed-mode between shared-memor...
متن کاملNURBS-Based Isogeometric Analysis Method Application to Mixed-Mode Computational Fracture Mechanics
An interaction integral method for evaluating mixed-mode stress intensity factors (SIFs) for two dimensional crack problems using NURBS-based isogeometric analysis method is investigated. The interaction integral method is based on the path independent J-integral. By introducing a known auxiliary field solution, the mixed-mode SIFs are calculated simultaneously. Among features of B-spline basis...
متن کامل